Change point estimation is often formulated as a search for the maximum of a gain function describing improved fits when segmenting the data. Searching through all candidates requires $O(n)$ evaluations of the gain function for an interval with $n$ observations. If each evaluation is computationally demanding (e.g. in high-dimensional models), this can become infeasible. Instead, we propose optimistic search methods with $O(\log n)$ evaluations exploiting specific structure of the gain function. Towards solid understanding of our strategy, we investigate in detail the $p$-dimensional Gaussian changing means setup, including high-dimensional scenarios. For some of our proposals, we prove asymptotic minimax optimality for detecting change points and derive their asymptotic localization rate. These rates (up to a possible log factor) are optimal for the univariate and multivariate scenarios, and are by far the fastest in the literature under the weakest possible detection condition on the signal-to-noise ratio in the high-dimensional scenario. Computationally, our proposed methodology has the worst case complexity of $O(np)$, which can be improved to be sublinear in $n$ if some a-priori knowledge on the length of the shortest segment is available. Our search strategies generalize far beyond the theoretically analyzed setup. We illustrate, as an example, massive computational speedup in change point detection for high-dimensional Gaussian graphical models.
translated by 谷歌翻译
Data-driven interatomic potentials have emerged as a powerful class of surrogate models for {\it ab initio} potential energy surfaces that are able to reliably predict macroscopic properties with experimental accuracy. In generating accurate and transferable potentials the most time-consuming and arguably most important task is generating the training set, which still requires significant expert user input. To accelerate this process, this work presents \text{\it hyperactive learning} (HAL), a framework for formulating an accelerated sampling algorithm specifically for the task of training database generation. The key idea is to start from a physically motivated sampler (e.g., molecular dynamics) and add a biasing term that drives the system towards high uncertainty and thus to unseen training configurations. Building on this framework, general protocols for building training databases for alloys and polymers leveraging the HAL framework will be presented. For alloys, ACE potentials for AlSi10 are created by fitting to a minimal HAL-generated database containing 88 configurations (32 atoms each) with fast evaluation times of <100 microsecond/atom/cpu-core. These potentials are demonstrated to predict the melting temperature with excellent accuracy. For polymers, a HAL database is built using ACE, able to determine the density of a long polyethylene glycol (PEG) polymer formed of 200 monomer units with experimental accuracy by only fitting to small isolated PEG polymers with sizes ranging from 2 to 32.
translated by 谷歌翻译
Density based representations of atomic environments that are invariant under Euclidean symmetries have become a widely used tool in the machine learning of interatomic potentials, broader data-driven atomistic modelling and the visualisation and analysis of materials datasets.The standard mechanism used to incorporate chemical element information is to create separate densities for each element and form tensor products between them. This leads to a steep scaling in the size of the representation as the number of elements increases. Graph neural networks, which do not explicitly use density representations, escape this scaling by mapping the chemical element information into a fixed dimensional space in a learnable way. We recast this approach as tensor factorisation by exploiting the tensor structure of standard neighbour density based descriptors. In doing so, we form compact tensor-reduced representations whose size does not depend on the number of chemical elements, but remain systematically convergeable and are therefore applicable to a wide range of data analysis and regression tasks.
translated by 谷歌翻译
在计算化学和材料科学中,创建快速准确的力场是一项长期挑战。最近,已经证明,几个直径传递神经网络(MPNN)超过了使用其他方法在准确性方面构建的模型。但是,大多数MPNN的计算成本高和可伸缩性差。我们建议出现这些局限性,因为MPNN仅传递两体消息,从而导致层数与网络的表达性之间的直接关系。在这项工作中,我们介绍了MACE,这是一种使用更高的车身订单消息的新型MPNN模型。特别是,我们表明,使用四体消息将所需的消息传递迭代数减少到\ emph {两},从而导致快速且高度可行的模型,达到或超过RMD17的最新准确性,3BPA和ACAC基准任务。我们还证明,使用高阶消息会导致学习曲线的陡峭程度改善。
translated by 谷歌翻译
农场动物成像的各种应用基于某些身体部位的重量和从动物的CT图像切割的估计。在许多情况下,由于扫描非镇静的活动物,通过CT图像中的姿势的巨大变化来增加问题的复杂性。在本文中,我们提出了一种估计来自(可能)活体动物的CT图像的切割和身体部位的重量的一般和鲁棒方法。我们通过弹性登记和联合功能和用于斗篷的回归分量的模型选择,适应基于多标准的分段以及具有大量特征和较少量的样本。通过兔育种程序中的真实应用来评估和说明所提出的技术,显示R ^ 2比以前的技术和方法高于以前的技术和方法。所提出的技术很容易适应类似的问题,因此,它在开源软件包中共享,以便为社区的利益。
translated by 谷歌翻译
在过去的15年中,视网膜图像中的船只的分割已成为医学成像中的强烈研究问题,其中数百种算法发布。血管分割技术的DE事实上基准数据集之一是驱动数据集。由于驱动器包含训练和测试图像的预定义分割,因此各种分段技术的公布性能结果应提供算法的可靠排名。在该研究中包括超过100篇论文,我们对公布性能分数的一致性进行了详细的数值分析。我们发现与使用视野(FOV)相关的报告分数不一致,这对性能分数产生了重大影响。我们试图消除使用数值技术来提供偏差,以提供最逼真的现实情况。根据结果​​,我们制定了几种调查结果,最值得注意的是:尽管有明确定义的试验集,所公布论文中的大多数排名都基于非比较的数字;与文献中报告的近乎完善的准确度分数相反,迄今为止所达到的最高精度分数在FOV区域中为0.9582,比人类注释器高出1%。我们开发用于识别和消除评估偏差的方法可以很容易地应用于可能出现类似问题的其他域。
translated by 谷歌翻译